CA682 Data Visualization Assignment

Bhavesh Bhagria

Student ID: 21262891

There are three CSV files that I am using. First is a coronavirus dataset that contains data for the different types of cases (confirmed, death, and recovered) for different countries recorded daily since the day of the outbreak. The second contains the data for the number of vaccine doses administered, number of people partially and fully vaccinated, daily. The third dataset, that I have received from the public source but not used, unfortunately, is the world population dataset, which contains the population for all countries.

These CSV files are created by Johns Hopkins University Centre for Systems Science and Engineering and they are available in a GitHub Public Repository, linked below. This data has been compiled from sources like the World Health Organization (WHO), the Centres for Disease Control and Prevention (CDC), and the Ministry of Health from multiple countries.

Dataset link: https://github.com/RamiKrispin/coronavirus These datasets can be considered a Big Data dataset because they fulfil all the Vs of Big Data.

Velocity: These datasets are updated daily and are available on a public GitHub repository.

Volume: These available datasets take more than 50GB of storage which includes text, numbers, and date series.In this assignment, I will be using four different CSV files: Cumulative Confirmed Case, Recovered Case, Death case, and stats data. Each File contains, country name, Province/State, latitude, Longitude, and Date series (Confirmed /Recovered/ Death respectively) from 22/JAN/2020 to till date (Updated on daily basis)

Variety: The dataset is updated from different sources. Refer: Dataset Link

Data Exploration, Processing, Cleaning and/or Integration

The virus and vaccine datasets are huge. They contain features and values that need processing. For example, the virus dataset has a column of ‘type’, that has three unique values, confirmed, death, and recovered. I used the Pandas and the NumPy libraries to manipulate, clean, and integrate both datasets.

Virus dataset: I check for the number of unique countries by using the nunique() function. As there were a lot of unique countries, I decided to drop the column, along with the province, lat, and long columns, using the drop() function. Next, I created three different datasets, for the three different types of cases, and divide the data of the main dataset into the three datasets using Boolean masking. Next, I used the groupby() function to sum the total cases for each date, so that we have the total number of cases for each unique date.

Coronavirus Dataset

Covid Vaccine Dataset

Vaccine Dataset: I followed a similar approach of dropping unnecessary features, creating seven different data frames for seven different continents, and using Boolean masking to transfer data from the main dataset to the seven different data frames. Here, I used a different function, to sum up, the three different vaccine features for each unique date, the pivot_table() function.

World Population Dataset

Data Pre-processing

We need to pre process the data, such as cleaning it and grouping and merging the data as per our needs to extract insights.

We will use the Coronavirus data to create an Interactive Graph for Confirmed Cases, Death Cases, Recovered Cases and Active Cases. These four types will be in form of buttons as a choice. The form will be a scatter plot.

Virus Data Pre-processing.

As there are different number of unique countries in different datasets, we will need to do a lot of pre processing with all the data sets to make sure the visualations are accuarate

Lets start with the Virus Dataset!!

We split the virus dataset into three different datasets:

Confirmed Cases, Death Cases, and Recovered Cases.

We will split the values of the main dataset into three different datasets of virus types-

1) Confirmed Cases

2) Death Cases

3) Recovered Cases

Transfering the data from the virus dataset to the three different datasets.

We will use the Boolean Masking Mechanism to Transfer data from the Main dataset into the 3 different datasets

We sum up all the cases to a single date using the GroupBy function.

We will use the groupby() function to group the date column with cases to calculate the sum of cases for a specific date, so that we dont we have multiple values for 1 date. As by using the groupby() function converts the dataset into a series, we will use the to_frame() function to reconvert it into a series.

Active Cases Dataset.

We create a new dataset, Active in which we calculate the active cases by subtracting the sum of recovered and death cases from the Confirmed cases

Virus Dataset Visualization

For the Virus dataset, I wanted to simplify the visualization for the audience, so that they can choose what they want to see. There are three types of cases in the virus dataset-confirmed, death, and recovered, the fourth one, active cases, I created by subtracting the sum of recovered and death cases from confirmed cases.

The choice of my graph for this dataset is scatter plots. I chose to scatter plots because these graphs represent the number of daily cases worldwide. As the markers in the scatter plots get spread all over the window, it can provide a better understanding of how dense or how narrow the situation is.

As the increase in the number of cases is not a good sign, to show the severity of the situation is used the color scale parameter, which uses darker colors for larger numerical values and lighter colors for smaller numerical values. Plotly displays a color scale bar on the right side of the window.

The functions that I have used are detailed below:

add_trace(): comes in plotly package will allow us to trace the details from time to time and "hovertemplate" attribute contains the details of both the axis when hovered.

update_layout(): contains an attribute "updatemenus" which allow us to add any type of menu with additional details. In this visualisation, I have selected "buttons" in "down" direction.

update_xaxes(): updates the x-axis labels, but, I have made "showticklabels=False" which will not display the x-axis labels.

update_layout(): will help to design the background layout to the graph by adding x_label, y_label, titles and so on.

rangeslider() will enable user to desired range for which he wants to see the data

rangeselector() are pre programmed buttons that will enable users to directly select the time period for which they want to see the visualization.

Vaccine data pre-processing

For this dataset, we will display the vaccination records according to different continents.

We start by creating 7 different dataframes for the continents, then we transfer the data from the main dataset to these datasets by using boolean masking. Then we will sum up cases for specific dates using pivot tables function.

Vaccine Dataset:

I followed a similar approach of dropping unnecessary features, creating seven different data frames for seven different continents, and using Boolean masking to transfer data from the main dataset to the seven different data frames. Here, I used a different function, to sum up, the three different vaccine features for each unique date, the pivot_table() function

Vaccine Dataset Visualization

For the Vaccine dataset, I wanted to do something new with choices that I offer to the customer. On many COVID-19 related websites, there is a list of the number of countries from which the customers choose for which country he/she wants to view the virus statistics for. Also, there are not that many websites, that are displaying the global vaccination statistics. Most of the customers who will view the statistics, don’t want in-depth information for every country. I have cleaned the dataset to display information not for 224 countries, but only for 7 continents.

My choice for these graphs is stacked bar charts. I have chosen a stacked bar chart for each continent because I am displaying three features for the vaccination status- the number of doses administered, the number of people partially vaccinated, and the number of people fully vaccinated. Similar to the previous graphs, I have used the plotly library for these graphs, as it has a lot of options for colorcoding, provides mechanisms for hover-effects, and provides interactivity mechanisms for the graphs. For interactivity, I have created 7 buttons, for 7 continents, that will display the stacked bar charts for the continent of which the button the customer clicks.

I have used the built-in color scales in the Plotly library to show the severity of the situation, as the number of cases, and the number of vaccine cases increase, the colors get darker.

The functions that I have used are detailed below:

add_trace(): comes in plotly package will allow us to trace the details from time to time and "hovertemplate" attribute contains the details of both the axis when hovered.

update_layout(): contains an attribute "updatemenus" which allow us to add any type of menu with additional details. In this visualisation, I have selected "buttons" in "down" direction.

update_xaxes(): updates the x-axis labels, but, I have made "showticklabels=False" which will not display the x-axis labels.

update_layout(): will help to design the background layout to the graph by adding x_label, y_label, titles and so on.

rangeslider() will enable user to desired range for which he wants to see the data

rangeselector() are pre programmed buttons that will enable users to directly select the time period for which they want to see the visualization.